Collect Delta extended statistics when creating table#15878
Conversation
|
rebase on master to use CI fix #15879 |
alexjo2144
left a comment
There was a problem hiding this comment.
Looks pretty good overall. Couple questions/nitpicks
There was a problem hiding this comment.
maybe just collectExtendedColumnStatisticsOnWrite ?
There was a problem hiding this comment.
Save the result of extractColumnMetadata so that you don't have to call it again at the bottom of this method.
| Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream() | |
| .map(ColumnMetadata::getName) | |
| .collect(toImmutableSet()); | |
| List<ColumnMetadata> columnMetadata = extractColumnMetadata(metadata, typeManager); | |
| Set<String> allColumnNames = columnMetadata.stream() | |
| .map(ColumnMetadata::getName) | |
| .collect(toImmutableSet()); |
There was a problem hiding this comment.
Per other comment, don't have to call extractColumnMetadata again.
There was a problem hiding this comment.
Why not includeMaxFileModifiedTime in this situation?
There was a problem hiding this comment.
Statistic aggregation during table creation does not have information about file_modified_time yet.
There was a problem hiding this comment.
Right, right. Then if the modified time isn't present we just use the current time when the collection is done. Makes sense.
There was a problem hiding this comment.
Can you please add a code comment explaining this consideration?
What do we need to have this information available?
There was a problem hiding this comment.
We should still test the old thing too
There was a problem hiding this comment.
Please do create a compatibility test with spark to verify that after a CTAS DESC EXTENDED works as intended on Databricks
There was a problem hiding this comment.
Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.
There was a problem hiding this comment.
Do consider documenting this new property in delta-lake.rst - either in this PR or a follow-up PR
There was a problem hiding this comment.
I would wait with documentation until other write operations are implemented if that's ok.
There was a problem hiding this comment.
nit: There's no need to change this line. I would revert.
Reduce map iterations and lookups to minimum, while also simplifying the code flow.
There was a problem hiding this comment.
nit: this kind of cosmetic changes can be done in a separate commit.
There was a problem hiding this comment.
This change make sense only with this commit as it allows collection to have 0 elements. It should throw exception before.
There was a problem hiding this comment.
Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.
There was a problem hiding this comment.
Can you please add a code comment explaining this consideration?
What do we need to have this information available?
There was a problem hiding this comment.
Time is added during statistics update.
Do you mean Maximum File modified time ?
There was a problem hiding this comment.
this line is now over line length limit, so --
we put all arguments on one line, or each on separate line
There was a problem hiding this comment.
that's minimal change, but that's not how you'd write the code if you were writing the code anew.
.flatMap(entry -> {
....
if (....) {
return Stream.of();
}
return Stream.of(Instant.ofEpochMilli(....));
})
.collect(toOptional());There was a problem hiding this comment.
It sounds like a problem and a workaround, but there isn't a problem
// File modified time does not need to be collected as a statistics because it gets derived directly from files being written
false);
There was a problem hiding this comment.
test_analyze_ -> test_ctats_stats_
There was a problem hiding this comment.
can you paste this method contents into testCreateTableAsStatistics above?
testCreateTableAsStatistics has good name and a javadoc, just the contents are worse
There was a problem hiding this comment.
i know it's preexisting but i don't think we need to assert split count in every test method here. It blurs the test's intent
(perhaps, we don't need it in any test, i don't know, but i am not requesting any change to existing tests)
this would be better:
assertUpdate("ANALYZE " + tableName);
|
@pajaks @findinpath @alexjo2144 thank you, this is awesome! |
In particular this improves Delta query performance on data sets created in the connector using CTAS.
Description
Collect delta lake statistics for CREATE TABLE AS.
Additional context and related issues
Release notes
(x) Release notes are required, with the following suggested text: